[feature](function) Support murmur_hash3_128 function#63196
Conversation
### What problem does this PR solve?
Issue Number: N/A
Related PR: N/A
Problem Summary: Add a 128-bit MurmurHash3 scalar function that returns LARGEINT. The function is implemented in BE, registered in FE Nereids scalar functions, and covered by BE hash function tests.
### Release note
Support murmur_hash3_128 function.
### Check List (For Author)
- Test: Unit Test
- ./run-fe-ut.sh --coverage --run org.apache.doris.nereids.rules.analysis.FunctionRegistryTest
- ./run-be-ut.sh --run '--filter=HashFunctionTest.*' attempted, but local worktree failed before compilation while downloading apache-orc submodule from GitHub.
- Behavior changed: Yes. Adds a new builtin scalar hash function.
- Does this need documentation: Yes
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
|
run buildall |
TPC-H: Total hot run time: 29748 ms |
TPC-DS: Total hot run time: 170678 ms |
FE UT Coverage ReportIncrement line coverage |
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
FE Regression Coverage ReportIncrement line coverage |
### What problem does this PR solve? Issue Number: N/A Related PR: apache#63196 Problem Summary: Add regression coverage for the new MURMUR_HASH3_128 scalar function. The tests validate NULL propagation, empty string, single argument, multiple arguments, unicode input, and table column evaluation against deterministic 128-bit LARGEINT results. ### Release note None ### Check List (For Author) - Test: Regression test - Added coverage in test_hash_function.groovy. - Ran git diff --check. - Attempted ./run-regression-test.sh --run -d query_p0/sql_functions/hash_functions -s test_hash_function, but the local running cluster does not include MURMUR_HASH3_128 yet and failed with Can not found function 'MURMUR_HASH3_128'. - Behavior changed: No - Does this need documentation: No
|
run buildall |
TPC-H: Total hot run time: 29625 ms |
TPC-DS: Total hot run time: 170846 ms |
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
FE Regression Coverage ReportIncrement line coverage |
FE UT Coverage ReportIncrement line coverage |
### What problem does this PR solve? Issue Number: None Related PR: apache#63196 Problem Summary: Align murmur_hash3_128 BE implementation with the existing 128-bit MurmurHash3 primitive for the first argument, while keeping Doris variadic hash semantics for multiple arguments by carrying the intermediate (h1, h2) state between arguments. Also make the LARGEINT packing layout explicit without std::memcpy. ### Release note None ### Check List (For Author) - Test: Manual test - Ran git diff --cached --check for the changed BE files. - Tried ./run-be-ut.sh --run '--filter=HashFunctionTest.*', but local CMake configuration stopped before building because contrib/openblas could not find OpenMP_C. - Behavior changed: No - Does this need documentation: No
|
/review |
### What problem does this PR solve? Issue Number: None Related PR: apache#63196 Problem Summary: Add focused coverage for murmur_hash3_128 argument combinations. BE unit tests now exercise const and non-const argument combinations through check_function_all_arg_comb, and regression tests cover table-column multi-argument calls to validate variadic state propagation with column and constant arguments. ### Release note None ### Check List (For Author) - Test: Manual test - Ran build-support/clang-format.sh be/test/exprs/function/function_hash_test.cpp. - Ran git diff --check for the changed test files. - Tried ./run-be-ut.sh --run '--filter=HashFunctionTest.*', but local CMake configuration stopped before building because contrib/openblas could not find OpenMP_C. - Tried ./run-regression-test.sh --run -d query_p0/sql_functions/hash_functions -s test_hash_function, but the connected local cluster did not include MURMUR_HASH3_128 yet and failed with Can not found function 'MURMUR_HASH3_128'. - Behavior changed: No - Does this need documentation: No
There was a problem hiding this comment.
Automated review completed. I did not find any blocking correctness issue in the PR.
Checkpoint conclusions:
- Goal and proof: The PR adds
murmur_hash3_128returningLARGEINT; BE implementation, FE/Nereids registration, unit tests, and regression coverage are present. The submitted test commands were partly blocked in the author's environment; I verified the diff and code paths but did not run the full Doris test suite in this runner. - Scope and clarity: The change is focused on one new scalar function and follows the existing variadic hash function structure.
- Concurrency: No new shared mutable state, locks, threads, or lifecycle-sensitive async behavior are introduced.
- Lifecycle/static initialization: No new cross-TU static objects with initialization dependencies were found; the added constants/functions are local/simple.
- Configuration: No config items are added.
- Compatibility: No storage/editlog format change. FE/BE function metadata and BE registration are both added for the new symbol.
- Parallel paths: Existing 32/64-bit hash paths are not modified beyond registration context; the new function follows the existing hash path pattern.
- Conditional checks/error handling: Unsupported BE column types return
Status::NotSupported, consistent with neighboring code. No ignoredStatusin the new BE execution path. - Tests: Coverage includes BE unit cases and regression cases for NULL, empty string, single/multiple args, unicode input, and table-column evaluation. Regression assertions are deterministic.
- Test results: No generated
.outresult was added for the new assertions; values are compared explicitly in Groovy. - Observability: No additional logs/metrics appear necessary for this scalar function.
- Transactions/persistence/data writes: Not applicable.
- FE-BE variable passing: Not applicable beyond scalar function registration/signature exposure.
- Performance/memory: No obvious avoidable allocation or untracked large memory ownership was introduced; execution remains per-column and in-place over the result vector.
- User focus: No additional user-provided review focus was present.
Residual risk: Full BE/regression execution was not performed by me in this runner, so CI remains the source of truth for build/test validation.
|
run buildall |
TPC-H: Total hot run time: 30812 ms |
TPC-DS: Total hot run time: 169194 ms |
FE UT Coverage ReportIncrement line coverage |
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
FE Regression Coverage ReportIncrement line coverage |
|
/review |
There was a problem hiding this comment.
I found one blocking issue in the new BE implementation. The function registration and FE/Nereids plumbing are focused and consistent with adjacent hash functions, and the added regression/BE tests cover ordinary nullable, constant, unicode, and multi-argument behavior. However, the new 128-bit hashing path introduces an unsafe length narrowing before the Murmur primitive, which can read out of bounds for large STRING values.
Critical checkpoint conclusions:
- Goal/test proof: The PR adds murmur_hash3_128 and tests normal single/multi-argument behavior, but misses the large-input boundary affected by the implementation.
- Scope/focus: The change is small and localized.
- Concurrency/lifecycle/config/compatibility: No new concurrency, lifecycle, config, storage format, or FE-BE protocol compatibility concerns found.
- Parallel paths: FE and BE registration paths for the new function are present; the issue is specific to the new 128-bit implementation path.
- Error handling/memory safety: Blocking issue found: size_t to int truncation can make murmur_hash3_x64_process operate on a negative length.
- Test coverage: Good normal-path coverage, but missing the boundary/negative test for lengths beyond int range.
- Observability/transactions/data writes: Not applicable.
- Performance: No additional performance issue found.
User focus: No additional user-provided review focus was specified.
| uint64_t hash[2] = {0, 0}; | ||
| // The first SQL argument starts from seed 0, so it can use the existing 128-bit primitive | ||
| // directly. Later arguments must use update_hash() to continue from the saved (h1, h2). | ||
| murmur_hash3_x64_128(data, static_cast<int>(size), 0, hash); |
There was a problem hiding this comment.
This narrows a SQL string length from size_t to int before entering murmur_hash3_x64_128 / murmur_hash3_x64_process. Doris string offsets are not limited to INT_MAX (StringLengthType is uint32_t), so a large STRING value can wrap to a negative len; murmur_hash3_x64_process then computes a negative nblocks and derives tail = data + nblocks * 16, which can point before the buffer and read out of bounds. Please either make the 128-bit Murmur primitive accept a non-narrowing length type (as other hash wrappers do at their public boundary) or reject/guard values larger than INT_MAX before calling it.
| qt_mmh3_u64_v2_fold_1 "SELECT MURMUR_HASH3_U64_V2('test') + 1;" | ||
| qt_mmh3_u64_v2_fold_2 "SELECT MURMUR_HASH3_U64_V2('a', 'b') * 2;" | ||
|
|
||
| def validate_mmh3_128 = { String expected, String expression -> |
There was a problem hiding this comment.
dont use this mode. directly use qt_sql and output the result to .out files
| struct MurmurHash3128Impl { | ||
| static constexpr auto name = "murmur_hash3_128"; | ||
|
|
||
| static Status empty_apply(IColumn& icolumn, size_t input_rows_count) { |
There was a problem hiding this comment.
At least 1 arg? then should return error here
| assertEquals(expected, res[0][0] == null ? null : res[0][0].toString()); | ||
| } | ||
|
|
||
| validate_mmh3_128(null, "NULL"); |
There was a problem hiding this comment.
should also test for some more cases, like zero arg, other type inputs, many nulls and so on
| // low 64 bits and h2 in the high 64 bits to match murmur_hash3_x64_128's out[0]/out[1]. | ||
| const auto value = | ||
| (static_cast<unsigned __int128>(h2) << 64) | static_cast<unsigned __int128>(h1); | ||
| return static_cast<__int128_t>(value); |
What problem does this PR solve?
Issue Number: N/A
Related PR: N/A
Problem Summary:
This PR adds a builtin 128-bit MurmurHash3 scalar function,
murmur_hash3_128, for callers that need a wider hash value than the existing 32-bit and 64-bit variants.be/src/exprs/function/function_hash.cppmurmur_hash3_128, returning LARGEINT.be/test/exprs/function/function_hash_test.cppBuiltinScalarFunctions.javaMurmurHash3128.javaScalarFunctionVisitor.javaDesign rationale: the function reuses the existing MurmurHash3 x64 128-bit processing path in BE and exposes the packed 128-bit result as LARGEINT, matching Doris' existing signed 128-bit integer representation.
BE execution logic:
murmur_hash3_128follows the same variadic hash execution model as the existing Doris hash functions. The BE function framework does not evaluate all arguments for one row in a single call. Instead,FunctionVariadicArgumentsBaseinvokes the implementation once per argument column:For a query such as:
the first round processes the whole
k1column withexecute<true>(). It creates one LARGEINT result slot per row and initializes each row's MurmurHash3 128-bit state from seed0. The second round processes the constant argument'world'withexecute<false>(). It reads each row's previous state from the result column, updates that state with the new argument bytes, and writes the packed state back.For example, if
k1has two rows:the state evolves as:
The underlying MurmurHash3 128-bit state is the pair
(h1, h2). For a single argument, the implementation can call the existingmurmur_hash3_x64_128(data, len, 0, out)directly. For multiple arguments, each later argument must continue from the previous(h1, h2)state; callingmurmur_hash3_x64_128independently for every argument would restart from seed0and lose the effect of earlier arguments.Because the SQL return type is LARGEINT and the BE result column stores one
__int128_tvalue per row, the implementation packs(h1, h2)into the result column between argument rounds and unpacks it before processing the next argument:This state storage is required by Doris' vectorized, per-argument-column execution model. It is not a requirement of the MurmurHash3 algorithm itself.
Release note
Support
murmur_hash3_128function.Check List (For Author)
Unit tests run locally:
./run-fe-ut.sh --coverage --run org.apache.doris.nereids.rules.analysis.FunctionRegistryTestpassed../run-be-ut.sh --run '--filter=HashFunctionTest.*'was attempted, but the local worktree failed before compilation while downloading theapache-orcsubmodule from GitHub.Behavior changed:
Does this need documentation?
Check List (For Reviewer who merge this PR)